This assignment focuses on applying the Feature Engineering processes and the Evaluation methods to solve a practical scenario: Predict the price of houses.
The dataset is offered in two separated fields, one for the training and another one for the test set.
original_training_data = read.csv(file = file.path("dataset/house_price_train.csv"))
original_test_data = read.csv(file = file.path("dataset/house_price_test.csv"))
To avoid applying the Feature Engineering process twice (once for training and once for test), I will join both datasets, apply the FE and then split the datasets again. However, if I try to do join the two dataframes as they are, we will get an error because they do not have the same columns: test_data does not have a column price. Therefore, I first create this column in the test set and then we join the data
original_test_data$price <- 0
dataset <- as.data.table(rbind(original_training_data, original_test_data))
Let’s now visualize the dataset to see where to begin
## id date price bedrooms
## Min. :1.000e+06 6/23/2014: 142 Min. : 0 Min. : 1.000
## 1st Qu.:2.123e+09 6/25/2014: 131 1st Qu.: 220000 1st Qu.: 3.000
## Median :3.905e+09 6/26/2014: 131 Median : 384000 Median : 3.000
## Mean :4.580e+09 7/8/2014 : 127 Mean : 431877 Mean : 3.373
## 3rd Qu.:7.309e+09 4/27/2015: 126 3rd Qu.: 580000 3rd Qu.: 4.000
## Max. :9.900e+09 3/25/2015: 123 Max. :7700000 Max. :33.000
## (Other) :20817
## bathrooms sqft_living sqft_lot floors
## Min. :0.500 Min. : 370 Min. : 520 Min. :1.000
## 1st Qu.:1.750 1st Qu.: 1430 1st Qu.: 5040 1st Qu.:1.000
## Median :2.250 Median : 1910 Median : 7618 Median :1.500
## Mean :2.116 Mean : 2080 Mean : 15099 Mean :1.494
## 3rd Qu.:2.500 3rd Qu.: 2550 3rd Qu.: 10685 3rd Qu.:2.000
## Max. :8.000 Max. :13540 Max. :1651359 Max. :3.500
##
## waterfront view condition grade
## Min. :0.000000 Min. :0.0000 Min. :1.00 Min. : 3.000
## 1st Qu.:0.000000 1st Qu.:0.0000 1st Qu.:3.00 1st Qu.: 7.000
## Median :0.000000 Median :0.0000 Median :3.00 Median : 7.000
## Mean :0.007547 Mean :0.2343 Mean :3.41 Mean : 7.658
## 3rd Qu.:0.000000 3rd Qu.:0.0000 3rd Qu.:4.00 3rd Qu.: 8.000
## Max. :1.000000 Max. :4.0000 Max. :5.00 Max. :13.000
##
## sqft_above sqft_basement yr_built yr_renovated
## Min. : 370 Min. : 0.0 Min. :1900 Min. : 0.00
## 1st Qu.:1190 1st Qu.: 0.0 1st Qu.:1951 1st Qu.: 0.00
## Median :1560 Median : 0.0 Median :1975 Median : 0.00
## Mean :1789 Mean : 291.7 Mean :1971 Mean : 84.46
## 3rd Qu.:2210 3rd Qu.: 560.0 3rd Qu.:1997 3rd Qu.: 0.00
## Max. :9410 Max. :4820.0 Max. :2015 Max. :2015.00
##
## zipcode lat long sqft_living15
## Min. :98001 Min. :47.16 Min. :-122.5 Min. : 399
## 1st Qu.:98033 1st Qu.:47.47 1st Qu.:-122.3 1st Qu.:1490
## Median :98065 Median :47.57 Median :-122.2 Median :1840
## Mean :98078 Mean :47.56 Mean :-122.2 Mean :1987
## 3rd Qu.:98118 3rd Qu.:47.68 3rd Qu.:-122.1 3rd Qu.:2360
## Max. :98199 Max. :47.78 Max. :-121.3 Max. :6210
##
## sqft_lot15
## Min. : 651
## 1st Qu.: 5100
## Median : 7620
## Mean : 12758
## 3rd Qu.: 10083
## Max. :871200
##
In this section I am going to preform some data cleaning.
The feature Id and date are not going to offer any advantage for prediction.
If we go back to the summary of the dataset we can identify some numerical features that are actually categories: What we have to do is to convert them to the proper ‘class’ or ‘type’.
dataset$condition <- factor(dataset$condition)
dataset$grade <- factor(dataset$grade)
dataset$zipcode <- factor(dataset$zipcode)
dataset$yr_built <- factor(dataset$yr_built)
dataset$yr_renovated <- factor(dataset$yr_renovated)
dataset$view <- factor(dataset$view)
# lets now turn characters into factors
dataset[ ,names(dataset)[sapply(dataset, is.character)]:=lapply(.SD,as.factor),
.SDcols = names(dataset)[sapply(dataset, is.character)]]
Let’s check if there are NAs in the dataset
#Counting columns with null values.
na.cols <- which(colSums(is.na(dataset)) > 0)
paste('There are', length(na.cols), 'columns with missing values')
## [1] "There are 0 columns with missing values"
We will now focus on numerical values. In this section I will detect numerical features which present outliers and I will clip those values.
# clipping the outliers
for (i in col_outliers_to_remove){
qnt <- quantile(dataset[[i]], probs=c(.25, .75))
caps <- quantile(dataset[[i]], probs=c(.05, .95))
H <- 1.5 * IQR(dataset[[i]])
dataset[[i]][dataset[[i]] < (qnt[1] - H)] <- caps[1]
dataset[[i]][dataset[[i]] > (qnt[2] + H)] <- caps[2]
}
# Plot again the box plots to verify if the procedure has been succesful
for (i in col_outliers_to_remove){
boxplot(dataset[,i, with=FALSE], type="p", xlab=i)
}
We now need to detect skewness in the Target value. Let’s see what is the effect of skewness on a variable, and plot it using ggplot. In order to get rid of the skewness I will use the log of the values of that feature, to flatten it.
We therefore transform the target value applying log
The same “skewness” observed in the target variable also affects other variables. To facilitate the application of the regression model we are going to also eliminate this skewness. For numeric feature with excessive skewness, perform log transformation
Now, let’s compute the skewness of each feature that is not ‘factor’ nor ‘character’. NB: I will not include lat and long.
column_types <- sapply(names(dataset), function(x) {
class(dataset[[x]])
}
)
numeric_columns <- names(column_types[column_types != "factor"])
numeric_columns <- numeric_columns[numeric_columns != 'lat' & numeric_columns != 'long']
At this point, we need to calculate the skewness of each numerical column.
# skew of each variable
skew <- sapply(numeric_columns, function(x) {
e1071::skewness(dataset[[x]], na.rm = T)
}
)
To reduce the skewness in the entire dataset it is possible to apply the log for those column with a skewness value greater than a given threshold that we’ve set in 1.
# transform all variables above a threshold skewness.
skew <- skew[abs(skew) > skewness_threshold]
for(x in names(skew)) {
dataset[[x]] <- log(dataset[[x]] + 1)
}
In this section I will create some new features to improve the predictive power of the dataset.
# having a look at the columns
colnames(dataset)
## [1] "price" "bedrooms" "bathrooms" "sqft_living"
## [5] "sqft_lot" "floors" "waterfront" "view"
## [9] "condition" "grade" "sqft_above" "sqft_basement"
## [13] "yr_built" "yr_renovated" "zipcode" "lat"
## [17] "long" "sqft_living15" "sqft_lot15"
# Creating a total square feet feature
dataset$TotalSqFeet <- as.numeric(dataset$sqft_basement + dataset$sqft_above + dataset$sqft_lot + dataset$sqft_living)
# Creating a Remodrnatoin variable (0 = No Remodeling, 1 = Remodeling)
dataset$Remod <- ifelse(dataset$yr_renovated != 0, 1, 0)
dataset$Remod <- factor(dataset$Remod)
# Creating a new feature for the total number of rooms in the house
dataset$rooms <- dataset$bedrooms + dataset$bathrooms
# Creating a new feature with the distance
dataset$distance <- distGeo(as.matrix(dataset[,c('long','lat')]), c(0,0))
# We will create a label that will agregate into "others" those grade with less than 3% of share
niche_grade<-names(which(summary(dataset$grade)/nrow(dataset)<0.01))
dataset[, grade_agg:=as.factor(ifelse(grade%in%niche_grade,'others',as.character(grade)))]
summary(dataset$grade)/nrow(dataset)
summary(dataset$grade_agg)/nrow(dataset)
sum(summary(dataset$grade_agg)/nrow(dataset))
dataset[, length(levels(grade_agg))]
dataset[, length(levels(grade))]
dataset[, length(levels(grade_agg))/length(levels(grade))-1] # important reduction in factor cathegories
dataset[, grade_agg:=factor(grade_agg, levels=names(sort(summary(dataset$grade_agg), dec=T)))]
# we drop off the former grade variable
dataset <- dataset[, grade:=NULL]
# let's see the median price per ZipCode and Bin the ZipCode in 3 chategory of price
summary(dataset[price!= 0, price])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.26 12.68 13.02 13.05 13.38 15.86
# let's see the median price per ZipCode and Bin the ZipCode in 3 chategory of price
summary(dataset[price!= 0, price])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.26 12.68 13.02 13.05 13.38 15.86
plot(dataset[price != 0 ,list(price=median(price)), by = zipcode], type="p")
# Defining the 3 bins for zipcode
zip_price <- dataset[price != 0 ,list(price=median(price)), by = zipcode]
cheep_zip <- zip_price[price < 12.5, zipcode]
medium_zip <- zip_price[price >= 12.5 & price < 13.5, zipcode]
exp_zip <- zip_price[price >= 13.5, zipcode]
# Applyng the bins
dataset$NeighBin[dataset$zipcode %in% exp_zip] <- 3
dataset$NeighBin[dataset$zipcode %in% medium_zip] <- 2
dataset$NeighBin[dataset$zipcode %in% cheep_zip] <- 1
# Factorizing the new feature
dataset$NeighBin <- factor(dataset$NeighBin)
To facilitate the data cleaning and feature engineering we merged train and test datasets. We now split them again to create our final model.
training_data <- dataset[which(dataset$price!=0),]
test <- dataset[which(dataset$price==0),]
We are going to split the annotated dataset in training and validation for the later evaluation of our regression models
splits <- splitdf(training_data)
training <- splits$trainset
validation <- splits$testset
Let’s try first a baseline including all the features to evaluate the impact of the feature engineering.
lm.model(training, validation, "Baseline")
Let’s try to use information gain to select the most important variables. In this section I will also use a cross validation procedure to select the best threshold for my model. As it is possible to notice the results are better than the baseline so from now on I will use the ig_training for all the models.
# applying information gain with the optimal trashold found
weights<- data.frame(information.gain(formula, training))
weights$feature <- rownames(weights)
weights[order(weights$attr_importance, decreasing = TRUE),]
## attr_importance feature
## zipcode 0.407883630 zipcode
## grade_agg 0.288890230 grade_agg
## lat 0.277114181 lat
## sqft_living 0.273301970 sqft_living
## TotalSqFeet 0.250527229 TotalSqFeet
## sqft_living15 0.218212778 sqft_living15
## sqft_above 0.195533477 sqft_above
## NeighBin 0.163604183 NeighBin
## distance 0.152663162 distance
## bathrooms 0.143773960 bathrooms
## rooms 0.128156052 rooms
## yr_built 0.066525710 yr_built
## sqft_lot15 0.061870353 sqft_lot15
## bedrooms 0.059315821 bedrooms
## floors 0.057253847 floors
## long 0.052752360 long
## view 0.048681269 view
## sqft_lot 0.048579780 sqft_lot
## sqft_basement 0.042102735 sqft_basement
## yr_renovated 0.018447617 yr_renovated
## condition 0.008416930 condition
## waterfront 0.007289474 waterfront
## Remod 0.005333362 Remod
information_gain_features <- weights$feature[weights$attr_importance > information_gain_CV(training, validation,
int_min = 0.003,
int_max = 0.009,
steps = 0.001)]
ig_training <- training[,c(information_gain_features,"price"),with=FALSE]
ig_test <- test[,c(information_gain_features,"price"),with=FALSE]
ig_validation <- validation[,c(information_gain_features,"price"),with=FALSE]
lm.model(ig_training, ig_validation, "Information Gain")
Let’s try the Ridge Regression
lambdas <- 10^seq(-3, 0, by = .05)
set.seed(121)
train_control_config <- trainControl(method = "repeatedcv",
number = 5,
repeats = 1,
returnResamp = "all")
ridge.mod <- train(formula, data = ig_training,
method = "glmnet",
metric = "RMSE",
trControl=train_control_config,
tuneGrid = expand.grid(alpha = 0, lambda = lambdas))
Plotting the RMSE for the different lambda values, we can see the impact of this parameter in the model performance. Small values seem to work better for this dataset.
plot(ridge.mod)
Plotting the coefficients for different lambda values. As expected the larger the lambda (lower Norm) value the smaller the coefficients of the features. However, as we can see at the top of the features, there is no feature selection; i.e., the model always consider the 277 parameters.
plot(ridge.mod$finalModel)
# evaluating the model
ridge.mod.pred <- predict(ridge.mod, ig_validation)
ridge.mod.pred[is.na(ridge.mod.pred)] <- 0
my_data <- as.data.frame(cbind(predicted=(exp(ridge.mod.pred) -1), observed=(exp(validation$price) -1)))
ridge.mod.mape <- mape(ig_validation$price, ridge.mod.pred)
ridge.mod.price_error <- mean(abs((exp(ridge.mod.pred) -1) - (exp(validation$price) -1)))
# plottong the results
ggplot(my_data, aes(predicted, observed)) +
geom_point() + geom_smooth(method = "glm") +
labs(x="Predicted") +
ggtitle(ggtitle(paste("Ridge", ' MAPE: ', format(round(ridge.mod.mape, 4), nsmall=4), ' --> Price ERROR:', format(round(ridge.mod.price_error, 0), nsmall=0),
'€', sep=''))) +
scale_x_continuous(labels = scales::comma) +
scale_y_continuous(labels = scales::comma)
Rank the variables according to the importance attributed by the model.
# Print, plot variable importance
plot(varImp(ridge.mod), top = 20) # 20 most important features
Now Let’s try lasso in combination with information gain. Moreover, I will use cross validation to identify the best number of seed
# applying the lasso regression with the optimal seed value found above.
lambdas <- 10^seq(-3, 0, by = .05)
set.seed(lasso_CV(training, validation,20,60,10))
train_control_config <- trainControl(method = "repeatedcv",
number = 5,
repeats = 1,
returnResamp = "all")
lasso.mod <- train(formula, data = ig_training,
method = "glmnet",
metric = "RMSE",
trControl=train_control_config,
tuneGrid = expand.grid(alpha = 1, lambda = lambdas))
Plotting the RMSE for the different lambda values, we can see the impact of this parameter in the model performance. Small values seem to work better for this dataset.
plot(lasso.mod)
Plotting the coefficients for different lambda values.
plot(lasso.mod$finalModel)
# evaluating the model
lasso.mod.pred <- predict(lasso.mod, ig_validation)
lasso.mod.pred[is.na(lasso.mod.pred)] <- 0
my_data <- as.data.frame(cbind(predicted=(exp(lasso.mod.pred) -1), observed=(exp(validation$price) -1)))
lasso.mod.mape <- mape(lasso.mod.pred, validation$price)
lasso.mod.price_error <- mean(abs((exp(lasso.mod.pred) -1) - (exp(validation$price) -1)))
# plotting the results
ggplot(my_data, aes(predicted, observed)) +
geom_point() + geom_smooth(method = "glm") +
labs(x="Predicted") +
ggtitle(ggtitle(paste("lasso", ' MAPE: ', format(round(lasso.mod.mape, 4), nsmall=4), ' --> Price ERROR:', format(round(lasso.mod.price_error, 0), nsmall=0),
' €', sep=''))) +
scale_x_continuous(labels = scales::comma) +
scale_y_continuous(labels = scales::comma)
Rank the variables according to the importance attributed by the model.
# Print, plot variable importance
plot(varImp(lasso.mod), top = 20) # 20 most important features
Now Let’s try Random Forest. After running all the models with the the defoult params, XGB was the best performing. For this reason I will apply the Hyper-parameters tuning just for xgb.
# defining the model
rf_0<-ranger(formula=formula, data=ig_training)
rf.pred<-predict(rf_0, data = ig_validation)$predictions
# evaluation
rf.mape <- mape(rf.pred, validation$price)
rf.price_error <- mean(abs((exp(rf.pred) -1) - (exp(validation$price) -1)))
my_data <- as.data.frame(cbind(predicted=(exp(rf.pred) -1), observed=(exp(validation$price) -1)))
# plotting the results
ggplot(my_data, aes(predicted, observed)) +
geom_point() + geom_smooth(method = "glm") +
labs(x="Predicted") +
ggtitle(ggtitle(paste("Random Forest", ' MAPE: ', format(round(rf.mape, 4), nsmall=4), ' --> Price ERROR:', format(round(rf.price_error, 0), nsmall=0),
' €', sep=''))) +
scale_x_continuous(labels = scales::comma) +
scale_y_continuous(labels = scales::comma)
In this section we’ll use xgb. First, we have to transform the data to allow xgb to work. We need to create dummies and to cast the data into a matrix. Furthermore, I will apply information gain to optimize the model.
xgb_training <- caret::dummyVars(formula= ~., data = training, fullRank=T,sep = "_")
xgb_training <- data.table(predict(xgb_training, newdata = training))
xgb_bench_training <- as.matrix(xgb_training[, !'price', with=F])
xgb_validation <- caret::dummyVars(formula= ~., data = validation, fullRank=T,sep = "_")
xgb_validation <- data.table(predict(xgb_validation, newdata = validation))
xgb_bench_validation <- as.matrix(xgb_validation[, !'price', with=F])
xgb_test <- caret::dummyVars(formula= ~., data = test, fullRank=T,sep = "_")
xgb_test <- data.table(predict(xgb_test, newdata = test))
xgb_bench_test <- as.matrix(xgb_test[, !'price', with=F])
# applying information gain with the optimal trashold found
weights<- data.frame(information.gain(formula, xgb_training))
weights$feature <- rownames(weights)
weights[order(weights$attr_importance, decreasing = TRUE),]
information_gain_features <- weights$feature[weights$attr_importance > information_gain_xgb_CV(xgb_training,
xgb_validation,
xgb_training$price,
xgb_validation$price,
int_min = 0.001,
int_max = 0.009,
steps = 0.001)]
xgb_ig_training <- xgb_training[,c(information_gain_features,"price"),with=FALSE]
xgb_ig_training <- as.matrix(xgb_ig_training[, !'price', with=F])
xgb_ig_test <- xgb_test[,c(information_gain_features,"price"),with=FALSE]
xgb_ig_test <- as.matrix(xgb_ig_test[, !'price', with=F])
xgb_ig_validation <- xgb_validation[,c(information_gain_features,"price"),with=FALSE]
xgb_ig_validation <- as.matrix(xgb_ig_validation[, !'price', with=F])
Now let’s define our first model using all the default params
# defining the model
xgb_0<-xgboost(booster='gbtree',
data=xgb_bench_training,
label= training$price,
nrounds = 100,
objective='reg:linear')
xgb.pred<-predict(xgb_0, newdata = xgb_bench_validation, type='response')
# evaluation
xgb.mape <- mape(xgb.pred, validation$price)
xgb.price_error <- mean(abs((exp(xgb.pred) -1) - (exp(validation$price) -1)))
my_data <- as.data.frame(cbind(predicted=(exp(xgb.pred) -1), observed=(exp(validation$price) -1)))
# Plotting the results
ggplot(my_data, aes(predicted, observed)) +
geom_point() + geom_smooth(method = "glm") +
labs(x="Predicted") +
ggtitle(ggtitle(paste("XGBoost", ' MAPE: ', format(round(xgb.mape, 4), nsmall=4), ' --> Price ERROR:', format(round(xgb.price_error, 0), nsmall=0),
' €', sep=''))) +
scale_x_continuous(labels = scales::comma) +
scale_y_continuous(labels = scales::comma)
To have a better performance I will use grid-search function to find the optimal value of the function’s hyperparameters.
# finding the best combination of hyperparameters
optmimized_param <- GridSerarch_xgb(xgb_ig_training, training$price)
Now we will define the model with the best hyperparameters.
# Defining the model
xgb_opt<-xgboost(booster='gbtree',
data=xgb_ig_training,
label= training$price,
nrounds = 100,
max.depth = optmimized_param[,'Depth'],
eta = optmimized_param[,'eta'],
subsample = optmimized_param[,'SubSampRate'],
colsample_bytree = optmimized_param[,'ColSampRate'],
min_child_weight = 1,
objective='reg:linear')
xgb.pred.opt<-predict(xgb_opt, newdata = xgb_ig_validation, type='response')
# evaluation
xgb.mape <- mape(xgb.pred.opt, validation$price)
xgb.price_error <- mean(abs((exp(xgb.pred.opt) -1) - (exp(validation$price) -1)))
my_data <- as.data.frame(cbind(predicted=(exp(xgb.pred.opt) -1), observed=(exp(validation$price) -1)))
# plotting the results
ggplot(my_data, aes(predicted, observed)) +
geom_point() + geom_smooth(method = "glm") +
labs(x="Predicted") +
ggtitle(ggtitle(paste("XGBoost", ' MAPE: ', format(round(xgb.mape, 4), nsmall=4), ' --> Price ERROR:', format(round(xgb.price_error, 0), nsmall=0),
' €', sep=''))) +
scale_x_continuous(labels = scales::comma) +
scale_y_continuous(labels = scales::comma)
In this secion we’ll try to apply Regression with StepWise feature selection.
# adapting the dataset to work for the algorithm
lm_ig_training <- ig_training
lm_ig_training$yr_built <- as.numeric(lm_ig_training$yr_built)
lm_ig_training$yr_renovated <- as.numeric(lm_ig_training$yr_renovated)
lm_ig_validation <- ig_validation
lm_ig_validation$yr_built <- as.numeric(lm_ig_validation$yr_built)
lm_ig_validation$yr_renovated <- as.numeric(lm_ig_validation$yr_renovated)
#### Regression with StepWise feature selection
lm_0<-stepAIC(lm(formula = formula,
data=lm_ig_training),
trace=F)
lm.pred<-predict(lm_0, newdata = lm_ig_validation)
# evaluation
lm.mape <- mape(lm.pred, validation$price)
lm.price_error <- mean(abs((exp(lm.pred) -1) - (exp(validation$price) -1)))
my_data <- as.data.frame(cbind(predicted=(exp(lm.pred) -1), observed=(exp(validation$price) -1)))
# plotting the results
ggplot(my_data, aes(predicted, observed)) +
geom_point() + geom_smooth(method = "glm") +
labs(x="Predicted") +
ggtitle(ggtitle(paste("StepWise", ' MAPE: ', format(round(lm.mape, 4), nsmall=4), ' --> Price ERROR:', format(round(lm.price_error, 0), nsmall=0),
' €', sep=''))) +
scale_x_continuous(labels = scales::comma) +
scale_y_continuous(labels = scales::comma)
in this section wel’ll try to apply Regression with regularization
# Defining the model
glmnet_cv<-cv.glmnet(x = xgb_ig_training,
nfolds = 5,
y = training[['price']],
alpha=1,
family = 'gaussian',
standardize = T)
plot.cv.glmnet(glmnet_cv)
glmnet_cv$lambda.min
## [1] 0.0003057683
glmnet_0<-glmnet(x = xgb_ig_training,
y = training[['price']],
family = 'gaussian',
alpha=1, lambda = glmnet_cv$lambda.min)
glmnet.pred <- predict(glmnet_0, newx = xgb_ig_validation)
# Evaluation
glmnet.mape <- mape(glmnet.pred, validation$price)
glmnetprice_error <- mean(abs((exp(glmnet.pred) -1) - (exp(validation$price) -1)))
my_data <- as.data.frame(cbind(predicted=(exp(glmnet.pred) -1), observed=(exp(validation$price) -1)))
# Plotting the results
ggplot(my_data, aes(s0, observed)) +
geom_point() + geom_smooth(method = "glm") +
labs(x="Predicted") +
ggtitle(ggtitle(paste("Regression with regularization", ' MAPE: ', format(round(glmnet.mape, 4), nsmall=4), ' --> Price ERROR:', format(round(glmnetprice_error, 0), nsmall=0),
' €', sep=''))) +
scale_x_continuous(labels = scales::comma) +
scale_y_continuous(labels = scales::comma)
We splitted the original training data into train and validation to evaluate the candidate models. In order to generate the final submission we have to take instead all the data at our disposal.
In addition, we also applied a log transformation to the target variable, to revert this transformation you have to use the exp function.
In order to do my prediction I have tried all the combination of the models explained above. The best model is the optimized xgb.
# Train the model using all the data
final.model <- xgb_opt
# Predict the prices for the test data (i.e., we use the exp function to revert the log transformation that we applied to the target variable)
final.pred <- as.numeric(exp(predict(final.model, xgb_ig_test, type='response'))-1)
final.pred[is.na(final.pred)]
## numeric(0)
hist(final.pred, main="Histogram of Predictions", xlab = "Predictions")
xgb_submission <- data.frame(Id = original_test_data$id, price= (final.pred))
colnames(xgb_submission) <-c("Id", "price")
write.csv(xgb_submission, file = "submission1.csv", row.names = FALSE)